Preface

The goal of this week assignment is to practice basic tools available in R for developing linear regression models with one or more variables, conduct visual and quantitative evaluation of their relative performance and reason about associated tradeoffs. We will continue working with abalone dataset (that you have already downloaded and used for the previous week assignment) and will use some of the variables available there to develop model of snail age. Given the simplicity of the measurements available in this dataset (essentially just dimensions and masses of various compartments of the mollusc) and potential variability in growth rates due to differences in environmental conditions (e.g. location, temperature, nutrients, etc.) that are not captured in this dataset, we should expect substantial fraction of variability in abalone age to remain unexplained as part of this exercise. Furthermore, given strong correlations between some of the predictors in this dataset it is possible that only a small number of those could be justifiably used in the model (for the reasons related to collinearity - see Ch.3.3.3 section 6 of ISLR).

Here an uninspiring example of the model of shell length and diameter is used to illustrate R tools that will be needed for this assignment. Please note that by this time abaDat dataset has been already created and corresponding columns have been named len and diam respectively – the variables names in your code likely will be different. Then a simple linear model can be fit using function lm() and summarized using summary:

summary(lm(len~diam,abaDat))
## 
## Call:
## lm(formula = len ~ diam, data = abaDat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.299726 -0.010843  0.000032  0.010515  0.137215 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.036913   0.001273   29.01   <2e-16 ***
## diam        1.194168   0.003032  393.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01944 on 4175 degrees of freedom
## Multiple R-squared:  0.9738, Adjusted R-squared:  0.9738 
## F-statistic: 1.552e+05 on 1 and 4175 DF,  p-value: < 2.2e-16

The plot of predictor and response with regression line added to it can be generated using standard R functions plot and abline:

plot(abaDat[,c("diam","len")])
abline(lm(len~diam,abaDat))

Diagnostic plots for this model can be obtained also by the call to plot with lm() result as input:

old.par <- par(mfrow=c(2,2))
plot(lm(len~diam,abaDat))

par(old.par)

R functions confint returns confidence intervals for model parameters and predict (with appropriate parameters) returns model predictions for the new data and corresponding estimates of uncertainty associated with them:

confint(lm(len~diam,abaDat))
##                  2.5 %     97.5 %
## (Intercept) 0.03441834 0.03940834
## diam        1.18822442 1.20011168
predict(lm(len~diam,abaDat),newdata=data.frame(diam=c(0.2,0.3,0.4,0.5)),interval='confidence')
##         fit       lwr       upr
## 1 0.2757469 0.2743778 0.2771161
## 2 0.3951638 0.3942926 0.3960349
## 3 0.5145806 0.5139889 0.5151722
## 4 0.6339974 0.6331926 0.6348021
predict(lm(len~diam,abaDat),newdata=data.frame(diam=c(0.2,0.3,0.4,0.5)),interval='prediction')
##         fit       lwr       upr
## 1 0.2757469 0.2376054 0.3138885
## 2 0.3951638 0.3570369 0.4332906
## 3 0.5145806 0.4764590 0.5527021
## 4 0.6339974 0.5958719 0.6721228

Problem 1: model of age and shell weight (30 points)

Here we will identify variable most correlated with the outcome (abalone age), build simple linear model of snail age (rings+1.5 as per dataset description) as function of this variable, evaluate model summary and diagnostic plots and assess impact of using log-transformed (instead of untransformed) attributes on the model peformance. The following steps provide approximate outline of tasks for achieving these goals:

  1. Calculate correlations between all continuous attributes in this dataset. Given potential non-linear relationship between some of the attributes and snail age, it might be prudent to use both Pearson and Spearman correlations to determine which variable is most robustly correlated with age.

  2. Fit linear model of age as outcome and shell weight as predictor using R function lm, display the result using summary function, use its output to answer the following questions:

  1. Create scatterplot of age and shell weight and add regression line from the model to the plot using abline function

  2. Create diagnostic plots of the model and comment on any irregularities that they present. For instance, does plot of residuals vs. fitted values suggest presence of non-linearity that remained unexplained by the model? How does it compare to the plot of the predictor and outcome with regression line added to it that was generated above?

  3. Use function confint to obtain confidence intervals on model parameters

  4. Use this model and predict function to make predictions for shell weight values of 0.1, 0.2 and 0.3. Use confidence and prediction settings for parameter interval in the call to predict to obtain confidence and prediction intervals on these model predictions. Explain the differences between interpretation of:
    • confidence intervals on model parameters and model predictions
    • confidence and prediction intervals on model predictions
    • Comment on whether confidence or predicion intervals (on predictions) are wider and why

Answer

abalone <- read.table("abalone.data", sep = ",")
colnames(abalone) <- c("sex","length","diameter","height","whole_weight","shucked_weight","viscera_weight","shell_weight","rings")
  1. Correlation
cor(abalone$length, abalone$rings, method = "pearson")
## [1] 0.5567196
cor(abalone$diameter, abalone$rings, method = "pearson")
## [1] 0.5746599
cor(abalone$height, abalone$rings, method = "pearson")
## [1] 0.5574673
cor(abalone$whole_weight, abalone$rings, method = "pearson")
## [1] 0.5403897
cor(abalone$shucked_weight, abalone$rings, method = "pearson")
## [1] 0.4208837
cor(abalone$viscera_weight, abalone$rings, method = "pearson")
## [1] 0.5038192
cor(abalone$shell_weight, abalone$rings, method = "pearson")
## [1] 0.627574
cor(abalone$length, abalone$rings, method = "spearman")
## [1] 0.6043853
cor(abalone$diameter, abalone$rings, method = "spearman")
## [1] 0.622895
cor(abalone$height, abalone$rings, method = "spearman")
## [1] 0.6577164
cor(abalone$whole_weight, abalone$rings, method = "spearman")
## [1] 0.630832
cor(abalone$shucked_weight, abalone$rings, method = "spearman")
## [1] 0.53942
cor(abalone$viscera_weight, abalone$rings, method = "spearman")
## [1] 0.6143438
cor(abalone$shell_weight, abalone$rings, method = "spearman")
## [1] 0.6924746

shell weight is most robustly correlated with age

  1. Fit model
model <- lm(rings~shell_weight, abalone)
summary(model)
## 
## Call:
## lm(formula = rings ~ shell_weight, data = abalone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9830 -1.6005 -0.5843  0.9390 15.6334 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.46212    0.07715   83.76   <2e-16 ***
## shell_weight 14.53568    0.27908   52.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.51 on 4175 degrees of freedom
## Multiple R-squared:  0.3938, Adjusted R-squared:  0.3937 
## F-statistic:  2713 on 1 and 4175 DF,  p-value: < 2.2e-16

Yes, there is significant association between shell_weight and age.

RSE=2.51, \(R^2\)=0.39

beta0=6.46, beta1=14.53. Age changes 14.53 units in the same direction as shell_weight changes 1 unit. The intercept is the age predicted for the abalones with 0 shell weight. It is a theoretical prediction and doesn’t have actual sense.

  1. Scatterplot
plot(abalone$shell_weight, abalone$rings)
abline(model, col="red")

  1. Diagnostic
plot(model)

According to residuals vs. fitted values plot, there is non-linearity remain unexplained. Analyzed together with the abline plot above, it can be suggested that transforming age to log(age) might be better.

According to Q-Q plot, the distribution is not normal.

  1. Confidence Interval
confint(model)
##                  2.5 %    97.5 %
## (Intercept)   6.310869  6.613365
## shell_weight 13.988525 15.082825
  1. Predict
predict(model,newdata=data.frame(shell_weight=c(0.1,0.2,0.3)),interval='prediction')
##         fit      lwr      upr
## 1  7.915684 2.992593 12.83878
## 2  9.369252 4.446701 14.29180
## 3 10.822819 5.900201 15.74544
predict(model,newdata=data.frame(shell_weight=c(0.1,0.2,0.3)),interval='confidence')
##         fit       lwr       upr
## 1  7.915684  7.808121  8.023247
## 2  9.369252  9.290188  9.448315
## 3 10.822819 10.739634 10.906005

Problem 2: model using log-transformed attributes (20 points)

  1. Use lm() to fit a regression model of log-transformed age as linear function of log-transformed shell weight and use summary to evaluate its results. Can we compare fits obtained from using untransformed (above) and log-transformed attributes? Can we directly compare RSE from these two models? What about comparing \(R^2\)? What would we conclude from this? (Please consult ISLR Ch.3.1.3 if unsure) What would be the physical meaning of model coefficients this time? What does model intercept represent in this case, for example? How sensible is this and how does it compare to that from the fit on untransformed data?

  2. Create a XY-scatterplot of log-transformed predictor and response and add corresponding regression line to it. Compared it to the same plot but in untransformed coordinates obtained above. What would you conclude from such comparison?

  3. Make diagnostic plots for model fit on log-transformed age and shell weight. Compare their appearance to that for the model using original scale of measurements. What would you conclude from this comparison about their relative quality?

Answer 1. lm fit

model2 <- lm(log(rings)~log(shell_weight), data=abalone)
summary(model2)
## 
## Call:
## lm(formula = log(rings) ~ log(shell_weight), data = abalone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83898 -0.14717 -0.04113  0.11985  0.85156 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.732532   0.007622   358.5   <2e-16 ***
## log(shell_weight) 0.291213   0.004101    71.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2151 on 4175 degrees of freedom
## Multiple R-squared:  0.547,  Adjusted R-squared:  0.5469 
## F-statistic:  5041 on 1 and 4175 DF,  p-value: < 2.2e-16

RSE is now 0.21 vs 2.51. \(R^2\) is now 0.54 vs 0.39. The prediction error is significantly smaller than untransformed model. More variance is explained (higher \(R^2\)).

The coefficients mean that 1 unit change of log(shell_weight) bring 0.29 unit change of log(rings). The intercept is the predicted log(rings) at the point where log(shell_weight)=0, i.e. shell_weight=1. Now it has actual meaning, vs. in untransfomed data, where it doesn’t.

  1. Scatterplot
plot(log(abalone$shell_weight), log(abalone$rings))
abline(model2, col="red")

In the transformed plot, the abline aligns to the scatterplot better.

  1. Diagnostic Plots
plot(model2)

All 4 plots looks more like a good fit. Residual vs Fitted plot doesn’t show significant evidence of non-linearity, and Q-Q plot indicates that the distribution is close to normal now.

Problem 3: Adding second variable to the model (10 points)

To explore effects of adding another variable to the model, continue using log-transformed attributes and fit a model of log-transformed age as a function of shell weight and shucked weight (both log-transformed also). Just an additive model – no interaction term is necessary at this point. Please obtain and evaluate the summary of this model fit, confidence intervals on its parameters and its diagnostic plots. Where applicable, compare them to the model obtained above and reflect on pros and cons of including shucked weight as another variable into the model.

Answer

model3 <- lm(log(rings)~log(shell_weight)+log(shucked_weight), data = abalone)
summary(model3)
## 
## Call:
## lm(formula = log(rings) ~ log(shell_weight) + log(shucked_weight), 
##     data = abalone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36610 -0.13074 -0.02342  0.10704  0.85220 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.866848   0.008535  335.90   <2e-16 ***
## log(shell_weight)    0.628341   0.012788   49.13   <2e-16 ***
## log(shucked_weight) -0.332552   0.012053  -27.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1978 on 4174 degrees of freedom
## Multiple R-squared:  0.6169, Adjusted R-squared:  0.6167 
## F-statistic:  3360 on 2 and 4174 DF,  p-value: < 2.2e-16
confint(model3)
##                          2.5 %     97.5 %
## (Intercept)          2.8501151  2.8835806
## log(shell_weight)    0.6032695  0.6534125
## log(shucked_weight) -0.3561826 -0.3089213
plot(model3)

It now gives us a better fit, RSE=0.20 vs 0.21, \(R^2\)=0.61 vs 0.54. The Residuals vs Fitted plot shows even less non-linearity. There are less points with high leverage.